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2.5 Exponential families. These will be families {Pe, G 0} of laws, including many 
of the best-known special families such as the binomial and normal laws, and for which 
there is a natural vector-valued sufficient statistic, whose dimension stays constant as the 
sample size n increases, and which has the Lehmann-Scheffe property. 

Definition. A family V = {Q^p ' ip G ^} of laws on a measurable space (X, £>), containing 
at least two different laws, is called an exponential family if there exist a cr-finite measure 
fx on (X, £>), a positive integer k, and real functions 9j on \1/ and measurable h with < 
h(x) < oo and Tj on X for j = 1, . . . , k, such that for all ip G Q,/, is absolutely continuous 
with respect to fx, and for some C(6(ip)) > 0, where 6>(-i/>) := (6i(ip), . . . , 9k(ip)), 

(2.5.1) (dQ^/dfi)(x) = C{emh(x)eMT. k ^iOM)Tj{x)). 

If we replace /U by z/ where dv(x) — h(x)dfx(x), the factor can be omitted, and v 
is still a a-finite measure. Given the 6j, Tj, /i, and fx, the number C(6'('i/')) is determined 
by normalization, so it is, in fact, a function of 9(ifj) := {0j(i/j)}j =1 . Thus, given Tj, h, 
and fx, is determined by the values of 0j(ip). 

It follows from the factorization theorem, Corollary 2.1.5, that for any exponential 
family, the vector- valued statistic (Ti(x),... , Tfc(x)) is a sufficient statistic. The struc- 
ture of an exponential family is essentially preserved by taking n i.i.d. observations, as 
follows. Let {Q^p, ip G \&} be any exponential family and let X\, . . . ,X n be i.i.d. (Q^). 
Then the distribution Q 1 ^ of {X\, . . . ,X n ) is an exponential family for the a-finite mea- 
sure LL n on X n , replacing T 3 (x) by Er=i T i( X 0, K x ) h J u 7=iH x j), and C {°{^)) h J 
C(6(ip)) n . It follows that for n i.i.d. observations from the exponential family, the fc-vector 
{^r=i Tj(Xi)}j =1 is a sufficient statistic. 

Since exponentials are strictly positive, any exponential family is an equivalent family 
as defined in the last section. The Tj will be called affinely dependent if for some constants 
Co, ci, . . . , Cfc, not all 0, Co + ciTi + ■ • • + CfcTfc = almost everywhere for fx. Then a ^ 
for some i > 1, and we can solve for Tj as a linear combination of other Tj and a constant. 
Then we can eliminate the Tj term and reduce k by 1, adding constants times to each 
for j 7^ z. Iterating this, we can assume that T\, . . . ,Tk are affinely independent, 
i.e. they are not affinely dependent. Likewise, we can define affine independence for the 
functions 9j, where now the linear relations among the Oj(-) and a constant would hold 
everywhere rather than almost everywhere (at this point we are not assuming a prior 
given on the parameter space We can eliminate terms until Oj(-) are also affinely 
independent. We will always still have k > 1 since V contains at least two laws. 

Let O be the range of the function ip i— > 8(ip) := (6'i(V'), • • • , ^fc(V')) from \1/ into M. h . 
Then clearly 6*i(-), . . . , Ok(-) are affinely independent if and only if © is not included in any 
(k — 1) -dimensional hyperplane in R h . Likewise, Ti, . . . ,T k are affinely independent (as 
defined above) if and only if for T := (Ti, ... , T/-) from X into the measure fx o T _1 
is not concentrated in any (k — l)-dimensional hyperplane in M. k . For each 9 G 0, let P$ 
be the law on X with (dP e /dfx){x) = C(6)h(x)e e - T ^ where 6 ■ T := £)J =1 ^T,-. Then 
Q^, = P 0W for all V e * and P = {P e : fie9}. 

A representation (2.5.1) of an exponential family will be called minimal if Ti, ... , Tfc 
are affinely independent, as are #i(-)> . . . , 
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A functionoid is an equivalence class of functions for the relation of almost sure equal- 
ity for a measure. The well-known Banach spaces L p of p-integrable functions, such as the 
Hilbert spaces L 2 of square-integrable functions, are actually spaces of functionoids. For 
an exponential family, or any other equivalent family, almost sure equality is the same for 
P e for all 9. 

2.5.2 Theorem. Every exponential family V := {Qip ' if) G has a minimal represen- 
tation (2.5.1), and then k is uniquely determined. 

Proof. We already saw that the Tj(-) can be taken to be affinely independent, as can the 
Oj(-), so that the representation (2.5.1) is minimal. As we also saw, the family V can be 
written as {Pg, 9 G ©}, C R k , where 

(dP e /dn)(x) = C{e)h{x)e - T{ - x) . 

Then the likelihood ratios are all of the form 

Re,* ■■= Rpo/p, = CWCW-'explE^i^-^)^-^)}. 

The logarithms of these likelihood ratios (log likelihood ratios) plus constants span a real 
vector space Vt of functionoids on X, included in the vector space Wt of functionoids 
spanned by l,Ti, . . . , Then Wt is (k + 1) -dimensional since Ti, . . . , Tfc are affinely 
independent by minimality. Also, since 9\, . . . ,9k are affinely independent on O, Vt = Wt- 
Now V := Vt is determined by the family V, not depending on the choice of \i or T, so 
V and k are uniquely defined for the family V. □ 

The number k will be called the order of the exponential family. From here on it 
will be assumed that the representation of an exponential family is minimal unless it is 
specifically said not to be. 

Any exponential family V can be parameterized by a subset of replacing 6j(ip) by 
6j, with 6 = {0(V>) : ^G*}, and 

(2.5.3) (dP e /dfi)(x) = C(9)h(x)exp(£ k j=i e j T j (x)), 9 e c R k , 

where now Q^p = Pe(^) for all ip e ^. The parameterization in (2.5.3) is then one-to-one: 

2.5.4 Theorem. If an exponential family has a minimal representation (2.5.3), then for 
any 9 ^ $ in 0, P ^ P<j>. 

Proof. If Pg = P^, then for 9 ■ T := ^ . 9jTj, we have almost everywhere 

9-T-\ogC{9) = (/>■ T -log C ((/>), 

or (9 — 4>) T = c for some c not depending on x. But 9 ^ <p means that the Tj are affinely 
dependent, contradicting minimality. □ 

Any subset of an exponential family is also an exponential family with the same Tj 
and v, recalling that dv{x) := h(x)dfi(x). It can be useful to take an exponential family 
as large as possible. Given v and Tj, j = 1,... , k the natural parameter space of the 
exponential family is the set of all 9 = . . . , 9k) G M fc such that 
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(2.5.5) K{9) := j expi^ k j=i e j T j {x))dv{x) < oo. 

Clearly K{9) > for all 9. For any 9 in the natural parameter space, we can define 
C(9) := 1/K(9) and get a probability measure Pq given by (2.5.3). So we have a family 
of laws Pq indexed by the natural parameter space. The family doesn't extend to values 
of 9 outside the natural parameter space since then normalization is not possible. 

2.5.6 Theorem. For any given cr-finite v and measurable functions Tj on (X, £>), the 
natural parameter space is a convex set in R k . 

Proof. First, for any real y (which can be positive or negative), 9 h- > e yd is a convex 
function of 9 E R (its second derivative is positive, so its first derivative is increasing, 
which implies convexity). It follows that for any real yi, ■ ■ ■ ,yk, the function 

9 = . . . ,9k) i— » exp(yi#i H h ykOk) 

is convex on IR fc . The inequalities defining convexity are preserved when integrated with 
respect to a nonnegative measure, so K{9) is a convex function, whose values may be 
infinite for some 9 (just those 9 outside the natural parameter space). The set where a 
convex function < +oo is clearly a convex set. □ 

2.5.7 Proposition. For any exponential family, the natural parameter space is the same 
for any number n of i.i.d. observations. 

Proof. If K n (9) is the integral K(9) for n observations, then from the definitions and the 
Tonelli-Fubini theorem, K n {9) = Ki(9) n for all n, so K n (9) is finite if and only if Ki(9) 
is. □ 



2.5.8 Theorem. For an exponential family as in (2.5.3) let U be the interior of the 
natural parameter space. Then for £ = (£i,... , £&) in U and rj = (r/i,... G 
let W := {( = £ + irj : £ e U, rj e M fc } so that Q = ^ + irjj for j = 1, . . . , k. 
Then the function K(z) in (2.5.5) is, on W, an analytic (holomorphic) function of z, 
representable by a power series in the k coordinates Zj — Q in the neighborhood of any 
point ( in W. In particular K has, on W, continuous partial derivatives of all orders 
with respect to z, which can be obtained by differentiating under the integral sign. In 
other words, for any p = (pi, . . . ,pk), where the p(i) := pi are nonnegative integers and 
[p] := pi + •••pki the partial derivative D P K := d^K{z)/dz^ 1 ^ > • • • dz^ exists and 

is continuous, and equals JT(x) p exp(^^ =1 ZjTj(x))dv(x), where t p := t\ • For 

any^t-r. /-.V/" = D P K(£)/K(£). ' 

Proof. Let £ = £ + ir\ e W, so £ G U and 7/ G M fc . Take e > small enough so that if 
1% — £j I < £ for all ,7 = 1,... , then u E U, so u + iv E W for any u G IR fc . Then for any 
T = T{x) E 

| e («+i«)-T| _ e u-T _ e (u-0-T e £,-T _ 

Thus, replacing <iz/(x) by e ? ' T ^ x ^(iz/(x), we can assume that £ = 0. Then \uj\ < e for 
j = l,... ,k. 
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Wehavee uT = IlJ =1 exp(uj-Tj), 

exp(ttiTi) = E^lo( u i T i) r / r! ' l( w i T i) r | = l^iri^i r, and 
YZLoluiTtf/rl = exp(|uiTi|) < exp(-eTi) + exp(eTi), 

and likewise for any j = 2, . . . , k in place of j = 1. By choice of s, 

I ' ' ' / n^ =1 exp(±£Tj)dz/(a;i) • • • du(xk) < oo 

for any choices of ±, where Tj := Tj(xj) for each j, so the sum over all 2 fc possible choices 
of ± of the integrals is finite. Thus by dominated convergence, the series 

e u-T = U^^UjTjP / rj \ 

converges absolutely if \uj\ < e for all j, and can be interchanged with 

J -dv(xi) ■ ■ ■ du(xk)- 

The integral yields a power series in u±, . . . , u^. In the above, Uj can be replaced by Uj+ivj 
if \uj + ivj\ < e for each j. So we get a power series converging to K(z) for z = u + iv. 
Since such a series exists in some neighborhood of each point in W, K(-) is holomorphic 
on W as stated. 

To show that derivatives can be taken under the integral sign, first let k = 1 and 
p = 1. IfO<t<c and y > then for A := t/c and x := cy, by convexity 
e \x < Xe x _|_ (i _ A) e ° < Ae x + 1, so (e ty - l)/t < e cy /c. Likewise, for < \t\ < c and all y, 
\(e ty — l)/t\ < (e cy +e~ cy )/c. For u in U, and c small enough, «±cGf/, so the functions 
{( e (t+u)T(x) _ e uT(x)yi . o < |t| < c} are dominated by an integrable function. So 

^ j e eT ^du(x)\ e=u = J T(x)e uT ^du(x). 

Also, \y\ < (e cy +e~ cy )/c. 

Forp>l, c- p (e cy + e~ cy ) p < (2/c) p (e pcy + e~ pcy ). For fixed p, u±pceUiorc 
small enough, so we can again apply dominated convergence to get 

(d p /d9 p ) J e dT ^du(x)\ e = u = J T(x) p e uT ^du(x). 

Now for k > 1, and any p G N fc , the 2 k (or fewer) points (ui ± cpi, ... , Uk ± cpfc) are all in 
t/ if c is small enough. Dominated convergence applies once more, so the derivatives can 
be interchanged with integrals as stated. 

The final statement follows easily since C(9) = 1/K(9), finishing the proof. □ 

Suppose given an exponential family as in (2.5.3) and let j(9) := log K(9) = 
- logC(0), so that dPejdv = exp(-j(#) + 6 • T) where 9 ■ T := £\ 9 j T j (x). Since the 
vector T := {Tj}^ =1 gives a sufficient statistic for the family, the means and variances of 
its components are of interest. They have nice expressions in terms of derivatives of the 
function j. The gradient of j is the vector- valued function Vj := (dj/dOi, . . . , dj/d9k)- 
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2.5.9 Corollary. For any 9 in the interior of the natural parameter space of the expo- 
nential family (2.5.3), EgT = Vj(9) and for any r, s = 1, . . . , k, 

cov e (T r ,T s ) = E e (T r T s ) - E T r E T s = d 2 j(9)/d9 r d9 s . 

Proof. Theorem 2.5.8 gives E e (T r T s ) = (d 2 K/d9 r d9 s )/K(9) and 

E e T r = (dK/d9 r )/K(9) = dj(9)/89 r . 

This gives the first conclusion. Taking d/89 s of both sides of the last equation, the latter 
conclusion follows. □ 

Any convex set in M. k either has non-empty interior or is included in some lower- 
dimensional affine subspace (RAP, 6.2.6). So for an exponential family of order k, the 
natural parameter space in a minimal representation, and so in has non-empty interior. 
(Here k > 1 since by definition an exponential family contains at least two laws.) So for 
an exponential family in a minimal representation on its natural parameter space the 
hypothesis of the following theorem holds: 

2.5.10 Theorem. If for an exponential family (2.5.3), 6 includes a non-empty open set 
U, then T is a Lehmann-Scheffe statistic. 

Proof. As noted above, T = (Ti, . . . , Tfc) is sufficient. Let p = (pi,... ,Pk) £ U. Replac- 
ing dv by exp(^2jPjTj(x))di', we can assume that p = 0. For some e > 0, U includes the 
cubeC e := {9: \9j\ < e for j = l,...,k}. 

Recall that a function measurable for T _1 (jF), where in this case T is the Borel a- 
algebra in R, is of the form foT for some measurable / (Theorem 2.1.3). Let / : M. k — > R 
be such that E e f(T(x)) = for all 9 E 0. Let / = /+ - /" where /+ := max(/, 0) and 
/- := - min(/, 0), so that /+ > and /" > 0. Let m := v o T" 1 on on R k . Then for 
all 9 E C e , 

u(9) := JeMEjOjt^f+^dmit) = v(9) := J exp(E,- ^0/" (t)dm(t). 

From u(0) = v(0) we see that / f + dm = / f~dm. Multiplying / by a constant, we 
can assume / f + dm = 1 (if it's zero, we are done). Then letting dP + := f + dm and 
dP~ := f~dm, P + and P~ are probability measures. 

For complex 9j = £j + irjj, in the strip S : |^| < s, j = 1, . . . , k, the above 
integrals u(9) and v(9) converge absolutely. By Theorem 2.5.8, they represent holomorphic 
(analytic) functions of 9 in S. Since u = v for 9 real (all rjj = 0), u and v have the same 
derivatives of all orders at 0. So u = v in a complex neighborhood of 0, and by analytic 
continuation (e.g. Bochner and Martin, 1948, p. 34), u = v throughout S. Taking £j = 
for all j, we see that P + and P~ on M. k have the same characteristic function. So P + = P~ 
by the uniqueness theorem (RAP, Theorem 9.5.1). (Here we have obtained a uniqueness 
theorem for Laplace transforms in a neighborhood of 0.) So / + = /~ and / = almost 
everywhere for z/, proving the Lehmann-Scheffe property. □ 
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Theorem 2.4.15 shows that the lower bound in the information inequality is attained, 
for densities continuously differentiable in 9, if and only if the family of laws is exponential 
as in (2.5.1) where the estimator T is an affine (linear) function of T\, T = aT\ +b for some 
constants a, b. On the other hand by Theorem 2.5.10, in such a case T\ is an LS sufficient 
statistic. By Corollary 2.5.9, it is an unbiased estimator of j'(9) where j(9) = log K(0). 

Let rj be any function not equal (even almost everywhere) to an affine function, such 
that £ e (??(Ti) 2 ) < oo for all 9. Then rj(Tx) is an unbiased estimator of y(9) := Egh{T\) 
which, among unbiased estimators, has smallest possible variance and minimal risk for all 
convex loss functions by Theorem 2.3.5, but where the information inequality lower bound 
is not attained for some 9. So the information inequality produces sharp results only in 
rather special cases. It may be more useful in the form allowing bias, Theorem 2.4.12, or 
in an asymptotic form, Theorem 3.8.3 below. 

The existence of a /c-dimensional sufficient statistic T = (Ti,... , Tf~) for an expo- 
nential family extends to any sample size n for n i.i.d. observations, as noted previously, 
replacing each Tj by Y^j=iTi{Xj). When R. A. Fisher first defined exponential families, 
one of the main properties he pointed out was the possibility of data reduction in this way. 
Moreover, he stated that if the data can be reduced, in other words if for i.i.d. Xi, . . . , X n 
there is a sufficient statistic of dimension k < n (even for one value of n) then the fam- 
ily of laws must be exponential. This is true under some regularity conditions, one of 
which is that the family be equivalent. For example, the family of uniform distributions 
on intervals [0,0], < 9 < oo, has a 1-dimensional sufficient statistic, the largest order 
statistic X( n ), but is evidently not equivalent and (so) not exponential. Other regularity 
conditions of continuity and differentiability will be assumed. If there were no such con- 
ditions, the "dimension" of a sufficient statistic would not be meaningful. For example, if 
X and Y are any two uncountable Borel sets in complete separable metric spaces, such as 
X = R k and Y = IR m , then there is always a 1-1, Borel measurable function from X onto 
Y with measurable inverse (RAP, Sec. 13.1). Any Borel measurable function is continuous 
when restricted to sets having nearly full measure (Lusin's theorem, RAP, Theorem 7.5.2). 
Also, for any m there is a continuous function from IR m into R, 1-1 almost everywhere for 
Lebesgue measure (Denny, 1964). 

The following example may illustrate the point. Let x and y be two numbers in [0, 1], 
each represented by its decimal expansion, x = Xm>i x n/10 n where each x n is 0, 1, . . . , or 
9, and likewise for y. By alternating digits define a real number z with digits Z2 n -i = x n 
and Zin = Vn f° r n = 1,2, . . . . This gives a correspondence between ordered pairs (x, y) of 
real numbers and individual real numbers z. Although it is not quite well-defined, because 
of ambiguities such as 0.099999999. . . = 0.100000. . . , 1-1 or continuous, the correspon- 
dence illustrates a reduction of dimension (from 2 to 1) which is not a real reduction in 
the sense of statistical interest. The example also shows why some regularity conditions 
such as differentiability may be expected in proofs about data reduction implying that a 
family is exponential. 

Let V be an equivalent family of probability measures. Let Q be a fixed law in the 
family. If T is a sufficient statistic for {P n : P G V}, the family of laws of n i.i.d. 
observations X\, . . . ,X n with laws in V, then by Corollary 2.1.5 for each P in V there is 
a function pp with 
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(2.5.11) n? =1 R P/Q ( Xj ) = p P (T(x u ... ,x n )) 

for almost all xi, . . . ,x n . T will be called strongly sufficient (with respect to given choices 
of Q and of Rp/Q for all x and all P G V) if (2.5.11) holds for all (and not only almost 
all) x. 

Let 4>p(x) := log Rp/q(x) for any P in P. We will be considering families for which 
the likelihood ratios Rp/Q are continuous non-zero functions of x, so that <f>p is continuous, 
and where every neighborhood of each point in the sample space has positive measure for 
each law in V, so that (f)p is determined everywhere by continuity and not only almost 
everywhere. So, strong sufficiency is a reasonable assumption. 

A function / on a region in IR fc is called C 1 if it has continuous first partial derivatives 
with respect to each of the k variables. It will be called BC 1 if these derivatives are also 
bounded. A real-valued function / on an interval U CM. will be called piecewise BC 1 if / 
is continuous on U and there is a finite set F C U such that / is BC 1 on U \ P, i.e. / is 
BC 1 on each of finitely many open intervals whose endpoints are in P or are endpoints of 
U. Now a fact can be stated: 

2.5.12 Theorem. Let P be a family of laws defined on a connected open set U in M. r and 
having continuous densities fp, P G P, with respect to Lebesgue measure A on U, with 
fp(x) > for all x G U and P in V (so P is equivalent). Suppose that all the functions 
fp are continuous on U and that for some positive integers k < n, there is a statistic T, 
continuous from U n into strongly sufficient for {P n : P G P}, where Rp/Q := fp/fQ- 
Then 

(a) If /c = 1, P is an exponential family of order 1. 

(b) If all the densities /p are PC 1 , or if r = 1 and they are piecewise PC 1 , then P is 
exponential of order at most k. 

Proof. For a given n, let S := Sp be the set of all continuous real functions <p such that 
for some function £, 

(2.5.13) 0(xi) + • • • + 0(x n ) = ({T{x u ... ,x n )) 

for all xi, . . . ,x n in U. (Such an equation results from taking logarithms in (2.5.11) with 
4>{x) := 4>p{x).) Clearly S is a vector space of functions containing the constants. Suppose 
S has dimension k + 1. Then it has a basis consisting of the constant 1 and k other functions 
01, . . . ,4>k, and each function <f>p is of the form ao(P) + a\{P)4>i{x) + • • • + ak(P)4>k(x), 
so the family is exponential of order at most k. 

Case 1: r = 1, so U is an open interval in R, and k = 1. It will be enough to prove that 
the dimension of S is at most 2. 

Suppose the hypothesis holds for some n > 2. It will be shown that it holds for n = 2. 
Let's indicate the dependence of pp and T on n in (2.5.11) by writing them as and 
respectively. By our choice of Pp/Q = fp/fQ, we have < Rp/q(y) < oo for all 
y £ U. Fix any 2/3, ...,y n G P. Then (2.5.11) holds for n = 2 with 

T^(x l7 x 2 ) := rW(n,i 2 ,i(s jfe) and pf \t) := p<?\t) /U] =3 R P/Q ( yj ) 

for any tel. Clearly is continuous. So the hypothesis holds for n = 2 and it suffices 
to treat that case. The following will be useful: 
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2.5.14 Lemma. If k = 1, n = 2, </> e S 1 , xo, y and z are in t/, 0(y) = cj){z) and 
s := T(xo,y)^t := T(xo, z), then (j) is constant in some neighborhood of Xq. 

Proof. We can assume that s < t and then that y < z, otherwise interchanging x and 
— x. Let y± := sup{tt : u < z and T(xq, u) = s}. Then by (2.5.13) and continuity of <f>, 
4>{y) = 4>(yi)- So we can assume y = y\. Likewise, we can assume z = inf{u : y± < u and 
T(xq, u) = t}. Then by continuity of T and the intermediate value theorem, 

(2.5.15) s < T(x , u) <t for y < u < z. 

Since <p(y) = 4>(z), by Rolle's theorem there is some v with y < v < z at which 
attains either its absolute maximum or absolute minimum on the closed interval [y,z]. By 
symmetry, suppose it is an absolute minimum. Let to := T(xq,v). Then for ( in (2.5.13), 
((to) = mm{((w) : s < w < t} since T(xq,-) takes [y,z] onto [s,t] by the intermediate 
value theorem and <j>(u) = ((T(xq, u)) — 4>(xq) (continuity of Q is not assumed or needed 
here) . 

Now s < to < t by (2.5.15). For x in some neighborhood V of xq, by continuity, 
T(x,y) < to < T(x,z) and s < T(x,v) < t. By the intermediate value theorem again, for 
each x in V there is some g(x) with T(x,g(x)) = to and y < g(x) < z. Then for each 
x e V, by (2.5.13) twice, 

C(*o) = <P(x) + <P(g(x)) > <P(x) + <p(v) = C(T(^v)) > C(t ). 

So both inequalities just above are equations. Also, C(*o) = ((T(x 07 v)) = <j>(xo) + 4>(v ), 
so (j)(x) = <j>(xo)i proving Lemma 2.5.14. □ 

Now let g and 7 be in S and suppose g is not constant, so g(y) ^ g(z) for some y and z 
in U. Then for any xq G U, T(xq, y) 7^ T(xq, z). For some c e M, r ){y)—cg(y) = r )(z)—cg(z). 
Let 4> := 7 — eg. Then (j) & S and 0(y) = 0(z), so by Lemma 2.5.14, is constant on 
a neighborhood of xo- Since xq was arbitrary in t/ and U is connected, = b on t/ for 
some constant 6, so 7 = eg + b and S 1 is at most 2-dimensional, finishing the proof for 
Case 1. 

Case 2: r = 1 and k > 1. Here we use: 

2.5.16 Lemma. Let /1, . . . , / n be piecewise SC 1 functions from an open interval {7 into 
R, x = (xi, . . . ,x n ) G t/ n , and G(x) := {Ej=i /i(^i)}?=i- If • • • > fn are linearly 
independent, then for some y G ?7 n , detJ(y) where J is the Jacobian matrix of G, 
J ij : = fi( x j) for i,j = l,... ,n. 

Proof. The proof will be by induction on n. The result clearly holds for n = 1. Suppose 
it holds for n — 1 but fails for n, for some fi, ■ ■ ■ , f n - Expanding the determinant by the 
minors of the last column gives 

(2.5.17) = a x (x u ... ,x n _i)/((x n )H h a n (xi, . . . , x n -\)f' n {x n ) 

for all x E U n with x n not in the finite union of finite sets where f[ don't exist. Here 
a n (xi, . . . ,x n -\) is the determinant for the n — 1 case and ,/n-i, so for some 
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z G U n , a n (zi, . . . , z n -i) 7^ 0. Let Xi = Zi for i = 1, . . . , n — 1 and integrate (2.5.17) 
with respect to x n over an interval, say from (a constant) b to (a variable) x, so 

= E"=l%-(*1,--- ,Z n -l)(fj{x) ~fj(b)). 

Since a n (zi, . . . , z n _i) 7^ 0, this contradicts the linear independence of 1, /\, . . . , / n , prov- 
ing Lemma 2.5.16. □ 

Now to prove Case 2, it is enough to show that the dimension of the space Si of C 1 
functions in S is at most k + 1. If not, let 1, /\, . . . , f n be linearly independent functions in 
Si with n = k + 1. The hypotheses hold for n = k + 1 since they hold for some n > k, as in 
Case 1. By Lemma 2.5.16 there is a point x G U n with det J(x) 7^ 0. Then by the inverse 
function theorem, e.g. Rudin (1976, Theorem 9.24), G is 1-1 on some neighborhood of x. 
But also, G = H{T) where T = (t u . . . ,t k ) and H(T) = (V>i(T),... ,ij> n (T)) for some 
functions ipi, so T must be 1-1 on the neighborhood. But this is impossible since T is 
continuous and reduces the dimension (see Appendix B), finishing the proof in Case 2. 
Case 3: r > 1. Consider part (b). To show that the dimension of Si is at most k + 1, 
since it contains the constants, is equivalent to showing that for any {/1, . . . , fk+i} C Si, 
the range of the function xh(/i(x),..., fk+i(x)) is included in some /c-hyperplane (i.e. 
/c-dimensional hyperplane) in M fc+1 , in other words 1, /1, fk+i are linearly dependent. 

Otherwise, there exist such fi and points x±,... ,Xk+2 in U such that the points 
{fj( x i)}j=i for i = 1, . . . , + 2 are not all in any /c-hyperplane in IR^" 1 " 1 . To see this, we 
can recursively select x\,X2, ...,Xk+2 such that X2 7^ x\ and for j > 3, Xj is not in the 
unique (j — 2)-dimensional hyperplane containing xi, ...,Xj-i. For each j, since /j G S 1 , 
take tpj such that fj(yi) + ■ ■ ■ + fj(y n ) = ipj(T(yi, • • • , J/n))- Let 7 be a SC 1 curve, i.e. 
a SC 1 function from an open interval Ii := (a, b) into £7, whose range contains all the 
points xi, . . . ,Xk+2- Such a 7 exists since C7 is open and connected. In more detail, for 
each u G U the open ball B(u, r) C U for some r > where £?(«, r) := {y : \y — u\ < r}. 
Let W(u) be the set of all v such that for some n < 00 there exist uq = u,ui, u n = v and 
ri > 0, i = 0,1,. ..,n such that B(ui,ri) C C7 for each i and B(ui,ri) fl B(ui-\, r^_i) 7^ 
for each i = 1, n. It is easily seen that is an open set included in C7 and that two 

sets W(u) and W{u'), if they intersect, are identical. Thus by connectedness, W(u) = U 
for each u G U. Now if t> G W(w) it is easy to construct a SC 1 curve within U joining u 
to v. 

Let 

,t fc+ i) := T( 7 (t!),... ,7(^+1)) 
for any ^ in 7i. Then F is continuous. Let ^(t) := fi{l{t)) for any £ in ii. Then 

9i(h)-\ \-9i(tk+i) = i^i{V{t u ... ,t k +i)) 

for any £1, . . . , tk+i G /1 and % = 1,... ,/c + l. Since fi and 7 are SC 1 functions, so are 
<7i, and they belong to the space Si for r = 1, I\ in place of {7, and V in place of T. 
But the range of t 1— > (<7i(£), • • • , <7fc+i(t)) for t G Ii is not included in any fc- hyperplane, 
contradicting Case 2. If = 1, then fi need not be BC 1 and Case 1 is applied instead, 
finishing the proof of Theorem 2.5.12. □ 
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Example. This will show why the connectedness of U is, or the continuity hypotheses 
are, needed in Theorem 2.5.12. Let U := (0,1) U (2,3) (which is not connected). Let 
the dominating measure v be the sum of Lebesgue measures on the two intervals. For 
< A < 1,0 < 9 < oo let 

4>e,\{x) := X9e 9x /(e e - 1), < x < 1; 

:= (l-\)8e 9 ( x - 2) /(e e - 1), 2 < x < 3. 

It is straightforward to check that this is a probability density for each 9 and A. Let xi, X2 
be i.i.d. with this density. Then the likelihood function is 

u{9, A, x u x 2 )9 2 exp^! + x 2 ))/{e - l) 2 

where u(9, A, xi, x 2 ) := A 2 for < x\ + x 2 < 2, 2A(1 — \)e~ 26 for 2 < x\ + x 2 < 4, and 
(1 — A) 2 e~ for 4 < x\ + x 2 < 6. It follows by Corollary 2.1.5, not only because of the 
factor exp {9{x\ + x 2 )) but because the ranges for different formulas for u(-, •, xi,x 2 ) also 
are functions of x\ + x 2 , that x\ + x 2 is a k = 1-dimensional sufficient statistic for the 
family with n = 2. Let 7(6*, A) := logf^e 61 — 1)]. Then one can check that 

log0 M (x) = 7 (0, A) + log A + 9x + [log((l - A)/A) - 29}1 2<X<3 . 

Since the functions x and l2<x<3 are affmely independent, as are the functions 9 and 
log((l — A)/A), we see that the family is exponential of order 2, not 1. Thus the conclusion 
of Theorem 2.5.12(a) does not hold in this case. The connectedness of the interval U is 
used in the proof more than once, by way of the intermediate value theorem. 

Of course, connectedness is only meaningful in connection with continuity of some 
functions. We could take U := (0, 2) to be connected in the example while 4>q^\ and T 
are discontinuous by replacing (2,3) by [1,2) and letting x\{x) = x for < x < 1 and 
x\(x) = x + 1 for 1 < x < 2, while taking x 2 as an i.i.d. copy of x\. 

Or, replacing the union of two intervals by a union of as many intervals as we like, 
we can get the exponential family to be of arbitrarily high order for n = 2. Similarly, by 
spreading the intervals farther apart, for example taking (0, 1) U (n, n+ 1) U (n 2 + n, n 2 + n + 
1), we can get a 1-dimensional sufficient statistic for any number n of i.i.d. observations, 
again if U is not connected or the densities and T are not continuous. 

Note that for r > 1, the dimension of the full data vector (Xi, . . . ,X n ) is nr, but 
that the assumption in Theorem 2.5.12 is k < n (and not k < nr). Suppose we consider 
a family of distributions IR r having densities with respect to some measure (not Lebesgue 
measure) which are functions of the first coordinate x\. Then x\ is a sufficient statistic. 
For n i.i.d. variables there is an n-dimensional sufficient statistic and n < nr, but the 
family need not be exponential. So the assumption k < n in Theorem 2.5.12 is sharp. 

Example. For a normal distribution N(m, a 2 ) on R, we can write the density as 

, n 9 . 1 io ( x 2 mx m 2 \ 
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These laws form an exponential family with T\(x) = x 2 , Qi(m,a 2 ) = — l/(2cr 2 ), T 2 (x) = 
x, and Q 2 {m,a 2 ) = m/a 2 . For N(m, a 2 ) n on M n , we get T x {x) = YJj=i x % T 2<» = 

En 
3=1 x r 

2.5.18 Proposition. Let X±, . . . ,X n be i.i.d. with law N(m,a 2 ) and n > 2. Then 
s 2 := (n — l) -1 ^2^ =1 (Xj — X) 2 has, among unbiased estimators of a 2 , smallest risk for 
squared-error loss, for all (m,cr 2 ). The risk of s 2 is 2<j 4 /(n — 1). 

Proof. We know that s 2 is an unbiased estimator of a 2 . Let S be the smallest cr-algebra 
for which T\ and T 2 are measurable in this case. Then S is sufficient. It is Lehmann-Scheffe 
by Theorem 2.5.10. Since s 2 = (n — l) _1 (Ti — T 2 /n), s 2 is 5-measurable. Then it follows 
from Theorem 2.3.5 that s 2 has minimum risk for squared-error loss (which is convex). 

To find the variance of s 2 , first note that if X has distribution iV(0, cr 2 ), then EX 4 = 
3cr 4 , by integration by parts or the moment generating function. Also, Es 2 = a 2 and 
we can assume m = 0. Make an orthogonal change of coordinates from (X±, . . . ,X n ) 
to (Yi, . . . , Y n ) where the Y"i axis is in the direction of (1,1,... ,1), so that Y"i = v}l 2 X. 
Then the Yj are i.i.d. N(0, a 2 ) and s 2 = (n - l)" 1 YTj=2 Y f ■ So 

E((s 2 ) 2 ) = (n-l)- 2 a 4 [3(n-l) + (n-l)(n-2)] = (n + l)a 4 /(n - 1), 

and var(s 2 ) = 2a 4 /(n - 1). □ 

Recall that for an unbiased estimator, the risk for squared-error loss is the same as the 
variance. Proposition 2.5.18 completes example (3) following Theorem 2.4.10 and shows 
that the information inequality lower bound (2.4.3) cannot be attained in this case. 

Unbiased estimators which are LS sufficient statistics are thereby optimal among un- 
biased estimators, but may fail in other ways. In Sec. 2.2, just after the definition of 
unbiased estimator, an example was given of an inadmissible unbiased estimator for e~ x 
where A is the parameter of a Poisson random variable X which is observed conditional on 
X > 1. Now, note that Poisson distributions conditional on X > 1 form an exponential 
family of order 1. So the pathology of unbiased estimators occurs even in this case where 
we have an LS sufficient one-dimensional statistic. We had also noted that a constant 
estimator may be admissible though biased. 

From what has been said so far it might seem that an estimator which is both unbiased 
and admissible, especially for an exponential family of order 1, might be a good estimator. 
But it may not be: consider the binomial distributions for n = 2 (for the number X of 
successes in 2 independent trials) and probability p of success, where < p < 1. Suppose 
the problem is to estimate p 2 . It's easily seen that the unique unbiased estimator U 
measurable for the minimal sufficient cr-algebra is U(0) = U(l) = (here U(l) = is 
surprising and bad) and U(2) = 1. Now, it will be shown that U is admissible for a wide 
class of loss functions. If the loss is a continuous function / of the difference T — p 2 for an 
estimate T, with / > and f(x) = if and only if x = 0, the risk for a given p is 

r(p,T) = (l-p) 2 f(T(0)-p 2 )+2p(l-p)f(T(l)-p 2 )+p 2 f(T(2)-p 2 ), 

so that r(p, U) = (1 - p) 2 f(-p 2 ) + 2p(l - p)f(-p 2 ) + p 2 f(l - p 2 ). If for some s > 1/2, 
f(x) — 0(\x\ s ) as |x| — >• 0, which holds if / is convex for s = 1, and where we can assume 
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s < 1, then r(p, U) = 0{p 2s ) as p | 0. If T is an estimator with r(p,T) < r(p, U) for all 
p, then taking p j 0, and noting that 2s > 1, it is easily seen that T(0) = and then that 
T(l) = 0. Then letting p j 1 shows that T(2) = 1. So T = U and C/ is admissible. 

Thus the two "good" properties of being admissible (for a great many loss functions) 
and unbiased, even in combination, and for an exponential family of order 1, still allow 
the absurd inference that the probability of success is when 1 success is observed in 2 
trials. Moreover, there do not seem to be theorems providing in adequate generality that 
there are admissible and/or unbiased estimators which behave well. So in the search for 
really good estimators we will need to consider other properties, such as those in the next 
chapter. 

PROBLEMS 

In problems 1-4, show that the given family V of laws is exponential. Specifically, 
find a a-finite measure ii such that the densities of laws in V with respect to fx are of the 
form C(9)h(x) exp(^2j 9jTj(x)) as in (2.5.1). Find the order of the family and a minimal 
representation. Give Tj, h, and C(9) explicitly. Then find the natural parameter space 
(largest possible set of {9j}), and indicate what functions the 9j are of the usual parameters. 

1. Let F be a finite set with k elements and V the family of all laws P on F which are 
not at any point. 

2. Let V be the family of binomial distributions £?(n,p), < p < 1, for any fixed n. 

3. (a) V is the family of geometric distributions P(k) = (1 —p) k ~ 1 p for k = 1, 2, . . . , with 
< p < 1. 

(b) V is the family of Poisson distributions P\{k) := e~ x \ k jk\ for k = 0, 1, . . . , with 
< A < oo. 

4. Let F be the so-called "extreme value" distribution function F(x) := exp(— e~ x ) for 
— oo < x < oo. Let its density be f(x) and consider the location family of all laws with 
densities f(x-0), 9 G R. 



5. Show that the family of all normal laws N(p, a 2 ) on R is exponential of order 2 and has 
a 1-dimensional continuous strongly sufficient statistic (for n = 1). Show that the family 
of distributions of i.i.d. normal (Xi,... ,X n ) has a 2-dimensional continuous strongly 
sufficient statistic for each n > 2. 

6. Let be distributions on M 2 such that X has distribution iV(0, r 2 ) and, given X, Y 
has distribution N(a + bX, cr 2 ), where tjj = (a 2 , r 2 , a, b). Find a minimal representation 
with functions 9i(^). Describe the family as a subset of the family given by the natural 
parameter space. 

7. Same question with a + bX replaced by a + bX + cX 2 . 

8. Find the natural parameter space for the exponential family on R with k = 1, T\{x) = 
x, and 

(a) dLi(x) = e~\ x \dx (b) dfi(x) = e~^dx/(l + x 2 ). 

NOTES 
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Fisher (1934) began the theory of exponential families. An important book on the 
topic is Barndorff- Nielsen (1978), who on p. 136 reviews the history of the subject, noting 
that characteristically Fisher was "mathematically somewhat imprecise." Brown (1986) is 
a more recent monograph. 

Let's say there is "data reduction" for a family of laws if for n i.i.d. observations 
there is a sufficient statistic of dimension less than n. From the beginning, one of the 
main properties of interest for exponential families was to be equivalent families allow- 
ing data reduction. Other early references are Darmois (1935), Koopman (1936) and 
Pitman (1936). Their names have sometimes been used for exponential families, as 
in "Koopman-Darmois," "Darmois-Koopman," "Koopman-Pitman-Darmois" or "Fisher- 
Darmois-Koopman-Pitman" families. 

The current form of Theorem 2.5.12, that under some continuity conditions the pos- 
sibility of data reduction implies that a family is exponential, is due for r = 1 to Brown 
(1964) with earlier work by Dynkin (1951), and for r > 1 to Barndorff- Nielsen and Ped- 
ersen (1968), which was the source of the proof given for Theorem 2.5.12. Interestingly, 
the precise statement (let alone the proof) is not given, although it is cited, in the book 
of Barndorff- Nielsen (1978), and Brown (1986) doesn't cite Brown (1964) or Barndorff- 
Nielsen and Pedersen (1968). 
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